Linear Regression (Term project)

Home Sweet Home - Predicts the rental price of accommodation

Table of Contents¶


1. Introduction
2. Problem Statement
3. Installing & Importing Libraries

  • 3.1 Installing Libraries
  • 3.2 Upgrading Libraries
  • 3.3 Importing Libraries

4. Data Acquisition & Description

  • 4.1 Data Description

5. Data Pre-processing

  • 5.1 Pre-Profiling Report
  • 5.2 Post-Profiling Report

6. Exploratory Data Analysis
7. Data Post-Processing

  • 7.1 Feature Extraction
  • 7.2 Feature Transformation
  • 7.3 Feature Scaling
  • 7.4 Data Preparation

8. Model Development & Evaluation
9. Conclusion


1. Introduction¶


Home Sweet Home

Company Introduction

Your client for this project is an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities.

  • Home Sweet Home (HSH) allows hosts to rent their homestays to other people as guests.
  • The company acts as a mediatory service for the same and has more than 80,000 hosts across 19 cities.
  • Their goal is to provide the best hospitality service to their customers in a more unique and personalized manner.

Current Scenario

  • The company is planning to introduce a new system that will help to easily monitor and predict the rental prices of homes across various cities.

2. Problem Statement¶


The current process suffers from the following problems:

  • The company monitors and validates the prices set by the hosts.
  • The process of validation is based on various factors such as city, neighborhood, neighborhood group, location on map, availability, and reviews.
  • It is time-consuming, resource-consuming, and sometimes inaccurate to estimate the proper price based on so many factors.

They have hired you as a data science consultant. They want to supplement their analysis and prediction with a more feasible and accurate approach.

Your Role

  • are given a historical dataset that contains the price of rental homes and many factors that determine that price.
  • Your task is to build a regression model using the dataset.
  • Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. You need to build the best possible model.

Project Deliverables

  • Deliverable: Predicts the rental price of accommodation.
  • Machine Learning Task: Regression
  • Target Variable: price
  • Win Condition: N/A (best possible model)

Evaluation Metric

  • The model evaluation will be based on the RMSE score.

3. Installing & Importing Libraries¶


3.1 Installing Libraries¶

In [ ]:
# !pip install -q datascience                                                       # Package that is required by pandas profiling
# !pip install -q pandas-profiling  

3.2 Upgrading Libraries¶

In [ ]:
# !pip install -q --upgrade pandas-profiling

3.3 Importing Libraries¶

In [1]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import plotly.express as px
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.linear_model import LinearRegression                   # Importing Linear Regression model
from sklearn.metrics import mean_squared_error                      # To calculate the MSE of a regression model
from sklearn.metrics import mean_absolute_error                     # To calculate the MAE of a regression model
from sklearn.metrics import r2_score                                # To calculate the R-squared score of a regression model
from sklearn.model_selection import train_test_split                # To split the data in training and testing part
from sklearn.preprocessing import StandardScaler                    # Importing Standard Scaler library from preprocessing
from sklearn.preprocessing import LabelEncoder                      # Importing Label Encoder library from preprocessing
#-------------------------------------------------------------------------------------------------------------------------------
import folium                                                       # Importing folium package
from folium import Map, Marker                                      # Importing folium to plot locations on map
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings('ignore')                                   # Warnings will appear only once

4. Data Acquisition & Description¶


In [2]:
HomeSweetHome = pd.read_csv('C:/Users/Mahesh/Downloads/Python/Term Projects/ML_Intermediate/rentel price/train_data.csv')
print('Data Shape:', HomeSweetHome.shape)
HomeSweetHome.head()
Data Shape: (137023, 17)
Out[2]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 city price
0 149653 Private bedroom located in Downtown Manhattan 257599351.0 Sandra And Katharina Manhattan Chinatown 40.71703 -73.99538 Private room 2.0 17.0 18/09/19 1.04 1.0 0.0 New York City 100.0
1 74702 Quiet, Comfy West LA Cottage - HSR 19-000047 2882551.0 James City of Los Angeles Mar Vista 34.01257 -118.44254 Entire home/apt 2.0 331.0 07/09/20 4.28 1.0 170.0 Los Angeles 102.0
2 95858 Home away from Home ! 287662307.0 Kahee Other Cities Pasadena 34.14255 -118.09888 Entire home/apt 10.0 4.0 29/03/20 0.37 1.0 306.0 Los Angeles 131.0
3 61301 Kukui’ula Club Villa 11 198477445.0 Lodge Kauai Koloa-Poipu 21.88471 -159.48359 Entire home/apt 1.0 0.0 NaN NaN 19.0 358.0 Hawaii 2399.0
4 132101 One bedroom apartment 87835557.0 Kostas Queens Astoria 40.76623 -73.90911 Entire home/apt 2.0 189.0 31/08/20 3.82 1.0 331.0 New York City 76.0

4.1 Data Description¶

  • In this section we will get description and statistics about the data.

Dataset Feature Description

The Dataset contains the following columns:

Column Name Description
host_id unique host Id
host_name name of the host
neighbourhood_group group in which the neighbourhood lies
neighbourhood name of the neighbourhood
latitude latitude of listing
longitude longitude of listing
room_type type of room
minimum_nights minimum no. of nights required to book.
number_of_reviews total number of reviews on the listing
last_review the date on which listing received its last review
reviews_per_month average reviews per month on listing
calculated_host_listings_count total number of listings by host
availability_365 number of days in the year the listing is available for rent
city region of the listing
price price of listing per night
In [3]:
HomeSweetHome.columns
Out[3]:
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365', 'city', 'price'],
      dtype='object')
In [4]:
HomeSweetHome.describe(include='all')
Out[4]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 city price
count 137023.000000 137007 1.370230e+05 137002 80300 137023 137023.000000 137023.000000 137023 137023.000000 137023.000000 107232 107232.000000 137023.000000 137023.000000 137023 137023.000000
unique NaN 132769 NaN 21866 17 1134 NaN NaN 4 NaN NaN 2271 NaN NaN NaN 19 NaN
top NaN A place of your own | 2BR in Las Vegas NaN Michael Manhattan Unincorporated Areas NaN NaN Entire home/apt NaN NaN 15/03/20 NaN NaN NaN New York City NaN
freq NaN 44 NaN 1228 16096 5222 NaN NaN 93651 NaN NaN 1858 NaN NaN NaN 36519 NaN
mean 85580.723397 NaN 9.632567e+07 NaN NaN NaN 34.584207 -101.510148 NaN 10.446071 33.138831 NaN 1.387220 17.226174 163.602446 NaN 205.281792
std 49471.662411 NaN 1.007056e+08 NaN NaN NaN 7.063332 28.013884 NaN 25.593436 61.903532 NaN 1.650963 52.670175 140.766748 NaN 504.573579
min 1.000000 NaN 2.300000e+01 NaN NaN NaN 18.920990 -159.714900 NaN 1.000000 0.000000 NaN 0.010000 1.000000 0.000000 NaN 0.000000
25% 42729.500000 NaN 1.431064e+07 NaN NaN NaN 30.249505 -118.365710 NaN 1.000000 1.000000 NaN 0.220000 1.000000 1.000000 NaN 75.000000
50% 85640.000000 NaN 5.274681e+07 NaN NaN NaN 36.058990 -90.105520 NaN 3.000000 7.000000 NaN 0.780000 2.000000 151.000000 NaN 120.000000
75% 128374.500000 NaN 1.543914e+08 NaN NaN NaN 40.718400 -73.989320 NaN 7.000000 37.000000 NaN 2.000000 7.000000 316.000000 NaN 200.000000
max 171279.000000 NaN 3.679071e+08 NaN NaN NaN 45.617270 -70.995950 NaN 1250.000000 966.000000 NaN 44.060000 593.000000 365.000000 NaN 24999.000000

Observations:

  • minimum_nights for some home stay can range from as low as a 1 to as high as 1250.

  • price for some home stay can range from as low as a 0 to as high as 24999.

  • 25% of price have around 75.

  • 50% of price have around 120.

  • 75% of price have around 200.

4.2 Data Information¶

  • In this section, we will get information about the data and see some observations.
In [5]:
HomeSweetHome.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137023 entries, 0 to 137022
Data columns (total 17 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              137023 non-null  int64  
 1   name                            137007 non-null  object 
 2   host_id                         137023 non-null  float64
 3   host_name                       137002 non-null  object 
 4   neighbourhood_group             80300 non-null   object 
 5   neighbourhood                   137023 non-null  object 
 6   latitude                        137023 non-null  float64
 7   longitude                       137023 non-null  float64
 8   room_type                       137023 non-null  object 
 9   minimum_nights                  137023 non-null  float64
 10  number_of_reviews               137023 non-null  float64
 11  last_review                     107232 non-null  object 
 12  reviews_per_month               107232 non-null  float64
 13  calculated_host_listings_count  137023 non-null  float64
 14  availability_365                137023 non-null  float64
 15  city                            137023 non-null  object 
 16  price                           137023 non-null  float64
dtypes: float64(9), int64(1), object(7)
memory usage: 17.8+ MB

Observations:

  • Out of 16 features, we have 1 int64 datatype features(id), 7 object type features (name, host_name, 'neighbourhood_group','neighbourhood','room_type','last_review','city'), and the rest are of float64 datatype features.

  • We may have to convert some variables like ('minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'price') into appropriate forms so we can use them for training purposes.


5. Data Pre-Processing¶


5.1 Pre Profiling Report¶

In [6]:
# profile = ProfileReport(rental_df, title="Rental Profiling Report")
# profile.to_file("Rental_report.html")
# print('Accomplished!')

Observations from Profile Report

Observations Values
Number of columns 17
Number of rows 137023
Missing cells 116342
Duplicate rows 0
Continuous type columns 10
Categorical type columns 7

B. Missing Data from below variables:

Observations Values
name 16
host_name 21
neighbourhood_group 56723
last_review 29791
reviews_per_month 29791

C. Below are unique values:

Observations Values
neighbourhood_group 17
room_type 4
city 19

Performing Operations

In [7]:
HomeSweetHome.isna().sum()
Out[7]:
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group               56723
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
minimum_nights                        0
number_of_reviews                     0
last_review                       29791
reviews_per_month                 29791
calculated_host_listings_count        0
availability_365                      0
city                                  0
price                                 0
dtype: int64
In [8]:
# lets check data for Nul values in 'reviews_per_month'

HomeSweetHome[HomeSweetHome['reviews_per_month'].isna()]
Out[8]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 city price
3 61301 Kukui’ula Club Villa 11 198477445.0 Lodge Kauai Koloa-Poipu 21.88471 -159.48359 Entire home/apt 1.0 0.0 NaN NaN 19.0 358.0 Hawaii 2399.0
8 47667 Beautiful, unique 3 bedroom, 3.5 bath home loc... 225117269.0 Hank NaN Highland 39.75848 -105.01377 Entire home/apt 3.0 0.0 NaN NaN 1.0 336.0 Denver 299.0
11 159356 Cozy Room at Roosvelt Island - Full Bed 16909509.0 Lu Manhattan Roosevelt Island 40.76440 -73.94666 Private room 5.0 0.0 NaN NaN 1.0 0.0 New York City 70.0
13 88622 Venue/ Hall / Party 85868649.0 Event Space Other Cities Glendale 34.14309 -118.26325 Entire home/apt 1.0 0.0 NaN NaN 2.0 364.0 Los Angeles 700.0
16 159582 OVERSIZED SUN-FLOODED STUDIO IN SUGAR HILL 24583865.0 Rebecca Manhattan Harlem 40.82749 -73.94484 Entire home/apt 30.0 0.0 NaN NaN 1.0 0.0 New York City 95.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
136999 130608 Private 2bdr apt. awesome LES/Chinatown location 6917811.0 Gia Manhattan Civic Center 40.71537 -74.00181 Private room 2.0 0.0 NaN NaN 1.0 0.0 New York City 130.0
137008 67221 Christmas In Maui! 18397245.0 Sherri Maui Kihei-Makena 20.68676 -156.43680 Entire home/apt 4.0 0.0 NaN NaN 1.0 0.0 Hawaii 550.0
137010 16023 The Red Cottage 10907013.0 Samantha NaN Fort Lauderdale 26.11271 -80.16019 Private room 1.0 0.0 NaN NaN 1.0 365.0 Broward County 150.0
137017 110268 Music City Villa in East Nashville sleeps 10 35100052.0 Victor NaN District 7 36.20469 -86.73784 Entire home/apt 2.0 0.0 NaN NaN 4.0 337.0 Nashville 259.0
137019 103694 FANTASTIC 2BR/2BA! POOL, CLOSE TO ATTRACTIONS 174792040.0 RoomPicks By Victoria City of Los Angeles Downtown 34.04472 -118.25665 Private room 1.0 0.0 NaN NaN 14.0 91.0 Los Angeles 464.0

29791 rows × 17 columns

In [16]:
HomeSweetHome[HomeSweetHome['neighbourhood_group'].isna()]['city'].value_counts()
Out[16]:
Broward County    8773
Austin            8432
Clark County      6718
New Orleans       5117
Chicago           5052
Nashville         4892
Portland          3448
Denver            3347
Boston            2663
Oakland           2555
Jersey City       2000
Asheville         1641
Columbus          1112
Cambridge          828
Pacific Grove      145
Name: city, dtype: int64
In [9]:
HomeSweetHome["neighbourhood_group"].value_counts(dropna=False)
Out[9]:
NaN                     56723
Manhattan               16096
Brooklyn                14615
City of Los Angeles     14092
Other Cities             9126
Maui                     6361
Honolulu                 5025
Queens                   4584
Hawaii                   3990
Kauai                    2588
Unincorporated Areas     2035
Bronx                     962
Newport                   264
Staten Island             262
Washington                139
Providence                122
Kent                       22
Bristol                    17
Name: neighbourhood_group, dtype: int64

Observation:

  1. we can see the the null values for 'last review' & 'reviews per month' are also null because of 'No of Reviews' is Zero
  2. we can fill null values in 'reviews per month' with 0
  3. 'Neighbourhood_group' missing values wil be filled with 'City'
  4. 'last_review' contains Date and it may not be valauble in projecting price while Training the Model, so it can be droped.
In [24]:
HomeSweetHome["reviews_per_month"].fillna(value=0, inplace=True)
HomeSweetHome["neighbourhood_group"]=np.where(HomeSweetHome["neighbourhood_group"].isnull(),HomeSweetHome['city'],HomeSweetHome["neighbourhood_group"])
In [26]:
print(HomeSweetHome.shape)
(137023, 17)
In [29]:
HomeSweetHome[HomeSweetHome.duplicated()]
Out[29]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 city price


6. Exploratory Data Analysis¶


6.1 Univariate Analysis¶

In [205]:
HomeSweetHome.city.value_counts()
Out[205]:
New York City     36519
Los Angeles       25253
Hawaii            17964
Broward County     8773
Austin             8432
Clark County       6718
New Orleans        5117
Chicago            5052
Nashville          4892
Portland           3448
Denver             3347
Boston             2663
Oakland            2555
Jersey City        2000
Asheville          1641
Columbus           1112
Cambridge           828
Rhode Island        564
Pacific Grove       145
Name: city, dtype: int64
In [40]:
# to check distribution of 'neighbourhood_group'

px.histogram(HomeSweetHome, x= 'neighbourhood_group', marginal='box',
                nbins=47, title='Distribution of neighbourhood_group')
In [45]:
# to check distribution of 'room_type'

px.histogram(HomeSweetHome, x= 'room_type', marginal='box',
                nbins=47, title='Distribution of room_type')
In [59]:
# to check distribution of 'minimum_nights'

plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['minimum_nights'])
Out[59]:
<AxesSubplot:xlabel='minimum_nights', ylabel='Density'>
In [60]:
# to check distribution of 'number_of_reviews'

plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['number_of_reviews'])
Out[60]:
<AxesSubplot:xlabel='number_of_reviews', ylabel='Density'>
In [61]:
# to check distribution of 'calculated_host_listings_count'

plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['calculated_host_listings_count'])
Out[61]:
<AxesSubplot:xlabel='calculated_host_listings_count', ylabel='Density'>
In [64]:
# to check distribution of 'availability_365'

plt.figure(figsize=(15,3))
sns.distplot(HomeSweetHome['availability_365'])
Out[64]:
<AxesSubplot:xlabel='availability_365', ylabel='Density'>
In [65]:
px.histogram(HomeSweetHome, x= 'availability_365', marginal='box',
                nbins=47, title='Distribution of availability_365')
In [107]:
px.histogram(HomeSweetHome, x= 'price', marginal='box',
                nbins=47, title='Distribution of price')
In [109]:
px.histogram(HomeSweetHome[HomeSweetHome['price']<1000], x= 'price', marginal='box',
                nbins=47, title='Distribution of price')

Observation:

  1. In 'Manhatten', 'City of Los Angeles' and 'Brooklyn' - Neighbourhood_Groups are more than 10000 stay available.
  2. There are more than 80000 Entire Home/ Apt properties are listed.
  3. 'Minimm Nights', 'Number of Reviews' and 'calculated_host_listings_count' having large nummber of outliers, as all these are skewed towards right.
  4. 'Availibilty_365' seems noramlly distributed.
  5. Price is distributed mostly in range between 0 to 400.
  6. There are 0 values also in - 'Number of Reviews', 'calculated_host_listings_count' , 'Availibilty_365' and 'Minimum Nights'

6.2 Bivariate analysis¶

In [68]:
px.scatter(HomeSweetHome,x='number_of_reviews',y='reviews_per_month')
In [69]:
px.scatter(HomeSweetHome,x='calculated_host_listings_count',y='price')
In [75]:
px.scatter(HomeSweetHome,y='calculated_host_listings_count',x='availability_365')
In [70]:
px.scatter(HomeSweetHome,x='reviews_per_month',y='price')
In [83]:
plt.figure(figsize=(6,4))
sns.barplot(data=HomeSweetHome,x='room_type',y='price')
Out[83]:
<AxesSubplot:xlabel='room_type', ylabel='price'>

6.3 Multivariate analysis¶

In [77]:
plt.figure(figsize=(8,8))
sns.heatmap(HomeSweetHome.corr(),annot=True,cmap='icefire')
Out[77]:
<AxesSubplot:>
In [78]:
HomeSweetHome.skew()
Out[78]:
id                                 0.001497
host_id                            1.051471
latitude                          -0.725820
longitude                         -0.740660
minimum_nights                    15.694661
number_of_reviews                  3.702934
reviews_per_month                  2.677926
calculated_host_listings_count     6.388123
availability_365                   0.182854
price                             20.790929
dtype: float64

Observation:

  1. There is slight relation between 'reviews_per_month' and 'number_of_reviews'
  2. 'Hotel Room' are highly priced than others.
  3. Greater the price lesser the reviews.
  4. 'Minimum_nights', 'number_of_reviews', 'number_of_reviews', 'calculated_host_listings_count' and 'price' are Higly positive skewed.
  5. 'id' is Correraled with 'latitude' and 'longitude'
  6. 'latitude' is Correraled with 'longitude'
  7. 'reviews_per_month' is Correraled with 'number_of_reviews'
  8. 'calculated_host_listings_count' is Correraled with 'availability_365'


7. Data Post-Processing¶


7.1 Feature Extraction¶

Categegorical & Continuous Varibale Split

In [84]:
HomeSweetHome.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137023 entries, 0 to 137022
Data columns (total 17 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              137023 non-null  int64  
 1   name                            137007 non-null  object 
 2   host_id                         137023 non-null  float64
 3   host_name                       137002 non-null  object 
 4   neighbourhood_group             137023 non-null  object 
 5   neighbourhood                   137023 non-null  object 
 6   latitude                        137023 non-null  float64
 7   longitude                       137023 non-null  float64
 8   room_type                       137023 non-null  object 
 9   minimum_nights                  137023 non-null  float64
 10  number_of_reviews               137023 non-null  float64
 11  last_review                     107232 non-null  object 
 12  reviews_per_month               137023 non-null  float64
 13  calculated_host_listings_count  137023 non-null  float64
 14  availability_365                137023 non-null  float64
 15  city                            137023 non-null  object 
 16  price                           137023 non-null  float64
dtypes: float64(9), int64(1), object(7)
memory usage: 17.8+ MB
In [99]:
HomeSweetHome_copy=HomeSweetHome.copy()
In [100]:
HomeSweetHome_copy[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
                   'availability_365','price']]=HomeSweetHome_copy[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
                   'availability_365','price']].astype('int')
In [101]:
HomeSweetHome_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137023 entries, 0 to 137022
Data columns (total 17 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              137023 non-null  int64  
 1   name                            137007 non-null  object 
 2   host_id                         137023 non-null  float64
 3   host_name                       137002 non-null  object 
 4   neighbourhood_group             137023 non-null  object 
 5   neighbourhood                   137023 non-null  object 
 6   latitude                        137023 non-null  float64
 7   longitude                       137023 non-null  float64
 8   room_type                       137023 non-null  object 
 9   minimum_nights                  137023 non-null  int32  
 10  number_of_reviews               137023 non-null  int32  
 11  last_review                     107232 non-null  object 
 12  reviews_per_month               137023 non-null  int32  
 13  calculated_host_listings_count  137023 non-null  int32  
 14  availability_365                137023 non-null  int32  
 15  city                            137023 non-null  object 
 16  price                           137023 non-null  int32  
dtypes: float64(3), int32(6), int64(1), object(7)
memory usage: 14.6+ MB
In [105]:
#Filtering the data without 0 Values in Price

HomeSweetHome_copy=HomeSweetHome_copy[HomeSweetHome_copy['price']>0]
In [106]:
HomeSweetHome_copy.shape
Out[106]:
(136981, 17)
In [112]:
HomeSweetHome_copy=HomeSweetHome_copy[HomeSweetHome_copy['price']<400]
HomeSweetHome_copy.shape
Out[112]:
(125699, 17)
In [113]:
df_cont = HomeSweetHome_copy.select_dtypes(exclude='object')

df_cat = HomeSweetHome_copy.select_dtypes(include='object')
In [114]:
df_cat.head()
Out[114]:
name host_name neighbourhood_group neighbourhood room_type last_review city
0 Private bedroom located in Downtown Manhattan Sandra And Katharina Manhattan Chinatown Private room 18/09/19 New York City
1 Quiet, Comfy West LA Cottage - HSR 19-000047 James City of Los Angeles Mar Vista Entire home/apt 07/09/20 Los Angeles
2 Home away from Home ! Kahee Other Cities Pasadena Entire home/apt 29/03/20 Los Angeles
4 One bedroom apartment Kostas Queens Astoria Entire home/apt 31/08/20 New York City
5 10th St Renovated Beauty- Rm 3U Scott Oakland Prescott Private room 31/05/19 Oakland
In [115]:
df_cont.head()
Out[115]:
id host_id latitude longitude minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 price
0 149653 257599351.0 40.71703 -73.99538 2 17 1 1 0 100
1 74702 2882551.0 34.01257 -118.44254 2 331 4 1 170 102
2 95858 287662307.0 34.14255 -118.09888 10 4 0 1 306 131
4 132101 87835557.0 40.76623 -73.90911 2 189 3 1 331 76
5 163184 1286670.0 37.81055 -122.29792 7 16 0 5 0 59

Dropping the un-necessary features

In [116]:
# 'name', 'host_name', 'neighbourhood', 'last_review' can be dropped from Categorical df
df_cat=df_cat.drop(['name', 'host_name', 'neighbourhood', 'last_review'], axis=1)

# 'id','host_id','reviews_per_month' can be dropped from Continuous df
df_cont=df_cont.drop(['id','host_id','reviews_per_month'],axis=1)
In [118]:
df_cont.head()
Out[118]:
latitude longitude minimum_nights number_of_reviews calculated_host_listings_count availability_365 price
0 40.71703 -73.99538 2 17 1 0 100
1 34.01257 -118.44254 2 331 1 170 102
2 34.14255 -118.09888 10 4 1 306 131
4 40.76623 -73.90911 2 189 1 331 76
5 37.81055 -122.29792 7 16 5 0 59
In [119]:
df_cat.head()
Out[119]:
neighbourhood_group room_type city
0 Manhattan Private room New York City
1 City of Los Angeles Entire home/apt Los Angeles
2 Other Cities Entire home/apt Los Angeles
4 Queens Entire home/apt New York City
5 Oakland Private room Oakland

7.2 Feature Transformation¶

In [120]:
df_cat = df_cat.apply(LabelEncoder().fit_transform)
df_cat.head()
Out[120]:
neighbourhood_group room_type city
0 18 2 14
1 9 0 11
2 24 0 11
4 28 0 14
5 23 2 15

Combining both Catgorical & Continous df

In [121]:
df_comb=pd.concat([df_cat, df_cont], axis = 1)
df_comb.head()
Out[121]:
neighbourhood_group room_type city latitude longitude minimum_nights number_of_reviews calculated_host_listings_count availability_365 price
0 18 2 14 40.71703 -73.99538 2 17 1 0 100
1 9 0 11 34.01257 -118.44254 2 331 1 170 102
2 24 0 11 34.14255 -118.09888 10 4 1 306 131
4 28 0 14 40.76623 -73.90911 2 189 1 331 76
5 23 2 15 37.81055 -122.29792 7 16 5 0 59
In [123]:
df_comb.skew()
Out[123]:
neighbourhood_group                0.229196
room_type                          0.788885
city                              -0.685629
latitude                          -0.786414
longitude                         -0.775323
minimum_nights                    15.289802
number_of_reviews                  3.601265
calculated_host_listings_count     6.702582
availability_365                   0.222720
price                              1.081080
dtype: float64

7.3 Feature Independant & Dependant Separation¶

In [124]:
def seperate_Xy(data=None):
    X = data.drop(labels=['price'], axis=1)
    y = data['price']
    return X, y
In [126]:
X, y = seperate_Xy(data=df_comb)
X.head()
Out[126]:
neighbourhood_group room_type city latitude longitude minimum_nights number_of_reviews calculated_host_listings_count availability_365
0 18 2 14 40.71703 -73.99538 2 17 1 0
1 9 0 11 34.01257 -118.44254 2 331 1 170
2 24 0 11 34.14255 -118.09888 10 4 1 306
4 28 0 14 40.76623 -73.90911 2 189 1 331
5 23 2 15 37.81055 -122.29792 7 16 5 0
In [127]:
y.head()
Out[127]:
0    100
1    102
2    131
4     76
5     59
Name: price, dtype: int32

7.3 Feature Scaling¶

  • In this section, we will perform standard scaling over the selected features.
In [128]:
scaler = StandardScaler() 
scaler.fit(X) 

X = pd.DataFrame(scaler.transform(X), index = X.index, columns = X.columns + '_S')
In [129]:
X.head()
Out[129]:
neighbourhood_group_S room_type_S city_S latitude_S longitude_S minimum_nights_S number_of_reviews_S calculated_host_listings_count_S availability_365_S
0 0.582539 1.354976 0.913594 0.838742 0.963337 -0.338781 -0.280974 -0.294627 -1.140292
1 -0.543619 -0.701633 0.236777 -0.124875 -0.648639 -0.338781 4.648905 -0.294627 0.069343
2 1.333311 -0.701633 0.236777 -0.106193 -0.636175 -0.022619 -0.485077 -0.294627 1.037052
4 1.833826 -0.701633 0.913594 0.845813 0.966466 -0.338781 2.419469 -0.294627 1.214939
5 1.208182 1.354976 1.139199 0.421000 -0.788463 -0.141180 -0.296674 -0.218766 -1.140292
In [130]:
y = np.log(y)
y.head()
Out[130]:
0    4.605170
1    4.624973
2    4.875197
4    4.330733
5    4.077537
Name: price, dtype: float64
In [133]:
X.skew()
Out[133]:
neighbourhood_group_S                0.229196
room_type_S                          0.788885
city_S                              -0.685629
latitude_S                          -0.786414
longitude_S                         -0.775323
minimum_nights_S                    15.289802
number_of_reviews_S                  3.601265
calculated_host_listings_count_S     6.702582
availability_365_S                   0.222720
dtype: float64

7.4 Data Preparation for Training & Testing¶

In [134]:
def Xy_splitter(X=None, y=None):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
    print('Training Data Shape:', X_train.shape, y_train.shape)
    print('Testing Data Shape:', X_test.shape, y_test.shape)
    return X_train, X_test, y_train, y_test
In [135]:
X_train, X_test, y_train, y_test = Xy_splitter(X=X, y=y)
Training Data Shape: (94274, 9) (94274,)
Testing Data Shape: (31425, 9) (31425,)


8. Model Development & Evaluation¶


In [136]:
def model_generator_lr():
    return LinearRegression()
In [137]:
clf = model_generator_lr()
In [164]:
def train_n_eval(clf=None):
    
    # Extracting model name
    model_name = type(clf).__name__
    
    # Fit the model on train data
    clf.fit(X_train, y_train)
    
    # Make predictions using test data
    y_pred = clf.predict(X_test)
    
    # Make predictions using test data
    y_pred_train = clf.predict(X_train)
    
    # Calculate test accuracy of the model
    clf_mae = mean_absolute_error(y_test, y_pred)
    
    # Calculate train accuracy of the model
    clf_mae_train = mean_absolute_error(y_train, y_pred_train)
    
    # Calculate test accuracy of the model
    clf_mse = mean_squared_error(y_test, y_pred)
    
    # Calculate train accuracy of the model
    clf_mse_train = mean_squared_error(y_train, y_pred_train)
    
    # Calculate test accuracy of the model
    clf_r2 = r2_score(y_test, y_pred)
    
    # Calculate train accuracy of the model
    clf_r2_train = r2_score(y_train, y_pred_train)
    
    # Display the accuracy of the model
    print('Performance Metrics for', model_name, ':\n')
    print('[Mean Absolute Error Train]:', clf_mae_train)
    print('[Mean Absolute Error Test]:', clf_mae, ':\n')
    print('*******************************\n')
    print('[Mean Sqaured Error Train]:', clf_mse_train)
    print('[Mean Sqaured Error Test]:', clf_mse, ':\n')
    print('*******************************\n')
    print('[Root Mean Sqaured Error Train]:', np.sqrt(clf_mse_train))
    print('[Root Mean Sqaured Error Test]:', np.sqrt(clf_mse), ':\n')
    print('*******************************\n')
    print('[R2-Score Train]:', clf_r2_train)
    print('[R2-Score Test]:', clf_r2)
In [165]:
train_n_eval(clf=clf)
Performance Metrics for LinearRegression :

[Mean Absolute Error Train]: 0.40754035769661023
[Mean Absolute Error Test]: 0.4048206163822222 :

*******************************

[Mean Sqaured Error Train]: 0.25813454531286567
[Mean Sqaured Error Test]: 0.25564061718755987 :

*******************************

[Root Mean Sqaured Error Train]: 0.5080694296184978
[Root Mean Sqaured Error Test]: 0.5056091545725412 :

*******************************

[R2-Score Train]: 0.35258113777827904
[R2-Score Test]: 0.3548097878310106

Observation:

Implementing Model on provided data¶

In [141]:
df_test=pd.read_csv('C:/Users/Mahesh/Downloads/Python/Term Projects/ML_Intermediate/rentel price/test_data.csv')
df_test.head()
Out[141]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 city
0 49653 Beachfront Bungalow & Home of Honu 24843776.0 Marco Maui Lahaina 20.95924 -156.68334 Entire home/apt 3.0 79.0 25/03/20 1.18 1.0 275.0 Hawaii
1 128272 Resort-like living in Williamsburg 14461742.0 Mohammed Brooklyn Williamsburg 40.71552 -73.93869 Entire home/apt 5.0 1.0 01/01/16 0.02 1.0 0.0 New York City
2 88753 Los Angeles Luxury Apartment in Downtown LA 205959517.0 Christian City of Los Angeles Downtown 34.04331 -118.25804 Entire home/apt 30.0 0.0 NaN NaN 1.0 354.0 Los Angeles
3 151475 E Z Living In Harlem 2 244536777.0 Chester Manhattan Harlem 40.81615 -73.94359 Entire home/apt 3.0 4.0 01/12/19 0.30 2.0 180.0 New York City
4 134525 Cozy, neat, spacious 2 BR apartment in East Ha... 2247818.0 Gonda Manhattan East Harlem 40.80268 -73.94051 Entire home/apt 10.0 33.0 13/03/20 0.77 1.0 10.0 New York City
In [167]:
df_test["reviews_per_month"].fillna(value=0, inplace=True)
df_test["neighbourhood_group"]=np.where(df_test["neighbourhood_group"].isnull(),df_test['city'],df_test["neighbourhood_group"])

df_test[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
                   'availability_365']]=df_test[['minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count',
                   'availability_365']].astype('int')
                   
df_test_cont = df_test.select_dtypes(exclude='object')

df_test_cat = df_test.select_dtypes(include='object')

df_test_cat=df_test_cat.drop(['name', 'host_name', 'neighbourhood', 'last_review'], axis=1)

df_test_cont=df_test_cont.drop(['id','host_id','reviews_per_month'],axis=1)


df_test_cat = df_test_cat.apply(LabelEncoder().fit_transform)

df_test_comb=pd.concat([df_test_cat, df_test_cont], axis = 1)

scaler.fit(df_test_comb) 

df_test_comb = pd.DataFrame(scaler.transform(df_test_comb), index = df_test_comb.index, columns = df_test_comb.columns + '_S')
In [170]:
df_test.head()
Out[170]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 city
0 49653 Beachfront Bungalow & Home of Honu 24843776.0 Marco Maui Lahaina 20.95924 -156.68334 Entire home/apt 3 79 25/03/20 1 1 275 Hawaii
1 128272 Resort-like living in Williamsburg 14461742.0 Mohammed Brooklyn Williamsburg 40.71552 -73.93869 Entire home/apt 5 1 01/01/16 0 1 0 New York City
2 88753 Los Angeles Luxury Apartment in Downtown LA 205959517.0 Christian City of Los Angeles Downtown 34.04331 -118.25804 Entire home/apt 30 0 NaN 0 1 354 Los Angeles
3 151475 E Z Living In Harlem 2 244536777.0 Chester Manhattan Harlem 40.81615 -73.94359 Entire home/apt 3 4 01/12/19 0 2 180 New York City
4 134525 Cozy, neat, spacious 2 BR apartment in East Ha... 2247818.0 Gonda Manhattan East Harlem 40.80268 -73.94051 Entire home/apt 10 33 13/03/20 0 1 10 New York City
In [186]:
# Make predictions using test data
y_test_pred = clf.predict(df_test_comb)
print(np.round(np.exp(y_test_pred),decimals=0))
[151. 133. 147. ...  71. 138. 161.]
In [199]:
index=df_test['id']
result=pd.DataFrame(np.round(np.exp(y_test_pred),decimals=0),index=index)
In [204]:
result.to_csv('C:/Users/Mahesh/Downloads/Python/Term Projects/ML_Intermediate/rentel price/result.csv', header=False)
print("Successful")
Successful
In [202]:
result
Out[202]:
0
id
49653 151.0
128272 133.0
88753 147.0
151475 142.0
134525 130.0
... ...
161801 65.0
28466 134.0
165580 71.0
40826 138.0
49668 161.0

34257 rows × 1 columns

In [ ]: